AITopics | sgd algorithm

Collaborating Authors

sgd algorithm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How AI settled the complexity of the oldest SGD algorithm

Dereziński, Michał, Dong, Xiaoyu

arXiv.org Machine LearningJun-30-2026

An essential catalyst for the remarkable breakthroughs in AI that led to the modern large language models (LLMs) such as ChatGPT and Gemini has been the algorithms used to train these models on massive datasets. While the LLM architectures have gotten progressively more complex, the training algorithms have stayed relatively simple, and in fact, they have all been based on the decades-old paradigm of stochastic gradient descent (SGD). The key idea behind SGD is that in order to minimize a certain objective function (such as an LLM's error on the training data), it suffices to access only a noisy estimate of that objective at any given time (e.g., based on a small sample of the data) while making incremental progress towards the solution. This is essential for LLM training, as the datasets have become so massive one could not hope to perform computations on everything all at once. Commonly attributed to a 1951 paper by Robbins and Monro [34], SGD has seen a resurgence of interest over the last 20 years by AI researchers and computer scientists striving to understand its effectiveness, leading to numerous variants and extensions used in modern LLMs [12, 9], most notably the Adam algorithm [25]. As a result, we have gained a robust mathematical understanding of the computational complexity of SGD algorithms in a wide range of settings (e.g., see [11, 15, 5, 17]). Yet, despite this progress there is a surprising gap in the understanding of SGD: The complexity of an algorithm proposed by Stefan Kaczmarz in 1937 [24] for solving a system of linear equations - the oldest published example of an SGD algorithm, which predates Robbins and Monro's paper by over a decade - has not been settled.

large language model, machine learning, natural language, (22 more...)

arXiv.org Machine Learning

2606.29593

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms Anonymous Author(s) Affiliation Address email

Neural Information Processing SystemsApr-30-2026, 05:48:15 GMT

Stochastic gradient descent (SGD) algorithm is the method of choice in many1 machine learning tasks thanks to its scalability and efficiency in dealing with2 large-scale problems. In this paper, we focus on the shuffling version of SGD3 which matches the mainstream practical heuristics. We show the convergence4 to a global solution of shuffling SGD for a class of non-convex functions un-5 der over-parameterized settings. Our analysis employs more relaxed non-convex6 assumptions than previous literature. Nevertheless, we maintain the desired compu-7 tational complexity as shuffling SGD has achieved in the general convex setting.8 1 Introduction9 In the last decade, neural network-based models have shown great success in many machine learning10 applications such as natural language processing [Collobert and Weston, 2008, Goldberg et al., 2018],11 computer vision and pattern recognition [Goodfellow et al., 2014, He and Sun, 2015].

artificial intelligence, convergence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
North America > Canada (0.28)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

65ccdfe02045fa0b823c5fa7ffd56b66-Paper-Conference.pdf

Neural Information Processing SystemsApr-26-2026, 12:59:40 GMT

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.98)
Information Technology > Data Science > Data Mining (0.93)

Add feedback

SEBOOST - Boosting Stochastic Learning Using Subspace Optimization Techniques

Elad Richardson, Rom Herskovitz, Boris Ginsburg, Michael Zibulevsky

Neural Information Processing SystemsApr-21-2026, 15:28:33 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)

Add feedback

On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms Anonymous Author(s) Affiliation Address email

Neural Information Processing SystemsFeb-17-2026, 20:41:10 GMT

While there has been much attention on the theoretical aspect of the traditional i.i.d.

artificial intelligence, convergence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
(6 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Data-driven Optimal Filtering for Linear Systems with Unknown Noise Covariances

Neural Information Processing SystemsFeb-17-2026, 11:36:05 GMT

This paper examines learning the optimal filtering policy, known as the Kalman gain, for a linear system with unknown noise covariance matrices using noisy output data. The learning problem is formulated as a stochastic policy optimization problem, aiming to minimize the output prediction error. This formulation provides a direct bridge between data-driven optimal control and, its dual, optimal filtering.

artificial intelligence, machine learning, survey article, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Overview (0.65)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

Add feedback

e64c9ec33f19c7de745bd6b6d1a7a86e-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 16:02:40 GMT

algorithm, sco problem, sgd, (15 more...)

Neural Information Processing Systems

Country: Europe > Czechia > Prague (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Efficiency Ordering of Stochastic Gradient Descent

Neural Information Processing SystemsDec-24-2025, 08:57:14 GMT

We consider the stochastic gradient descent (SGD) algorithm driven by a general stochastic sequence, including i.i.d noise and random walk on an arbitrary graph, among others; and analyze it in the asymptotic sense. Specifically, we employ the notion of `efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo (MCMC) samplers, for SGD algorithms in the form of Loewner ordering of covariance matrices associated with the scaled iterate errors in the long term. Using this ordering, we show that input sequences that are more efficient for MCMC sampling also lead to smaller covariance of the errors for SGD algorithms in the limit. This also suggests that an arbitrarily weighted MSE of SGD iterates in the limit becomes smaller when driven by more efficient chains. Our finding is of particular interest in applications such as decentralized optimization and swarm learning, where SGD is implemented in a random walk fashion on the underlying communication graph for cost issues and/or data privacy. We demonstrate how certain non-Markovian processes, for which typical mixing-time based non-asymptotic bounds are intractable, can outperform their Markovian counterparts in the sense of efficiency ordering for SGD. We show the utility of our method by applying it to gradient descent with shuffling and mini-batch gradient descent, reaffirming key results from existing literature under a unified framework. Empirically, we also observe efficiency ordering for variants of SGD such as accelerated SGD and Adam, open up the possibility of extending our notion of efficiency ordering to a broader family of stochastic optimization algorithms.

efficiency ordering, name change, stochastic gradient descent, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback